Natural language processing (NLP) has an ever-increasing importance in
biomedical informatics due to the exponential growth in research
publications \citep{Hunter:2006}.  NLP is essential for managing vast
amounts of unstructured text, and facilitates access to information and data extraction
that would be intractable as a manual task.  A number of core NLP
technologies used in biomedical informatics could benefit from
knowledge of {\it verb subcategorization}, i.e.~the tendency of verbs
to ``select'' the syntactic phrase types they co-occur with: for
example, the fact that the verb {\it decrease} can be intransitive
({\it The contribution decreased}), while {\it compare} cannot ({\it
We compared the predictions}, but not simply {\it We
compared}). Technologies such as syntactic and semantic parsing, event
identification, relation extraction, and entailment detection all have
the potential to make use of subcategorization information.  For
example, \citep{ananiadou:10,rupp:10} used {\it subcategorization
frames} (SCFs) in event extraction from UKPubMedCentral documents.

While subcategorization resources and techniques are relatively
well-developed for general
text \citep{brent:91,brent:93,ushioda:93,manning:93,briscoe:97,korhonen:02,valex,preiss:07},
this is not yet the case for biomedicine, and the field of biomedical
informatics currently lacks a baseline understanding of how well SCF
acquisition technology performs in different subdomains of
biomedicine, and where domain adaptation efforts should be
targeted.  Studies of the
lexical characteristics of biomedical subdomains 
\citep{lippincott:10,Lippincott:2011} have shown
substantial variation, both between general and biomedical text and
across subdomains, which points to a need for accurate, comprehensive,
% domain-adapted, 
automatically-acquired lexical resources, since it is impractical to develop them manually for each subdomain.
%  in order to achieve the
% necessary (sub)domain adaptation.  
Automatic methods
% , in addition to the benefits of more efficient domain adaptation,
also facilitate the gathering of statistical information on the frequency
of terms and linguistic contexts, which can be put to further use in
NLP systems.  So far, however, fully automatic methods are more common for
nouns than verbs \citep{yu:02,mccrae:08,widdows:06}, despite the fact that verbs
are central to recovering the meaning and structure of sentences, and to
discovering relations between biomedical entities.  
In addition, there has as yet been no standard evaluation for SCF acquisition in
biomedicine, a prerequisite for understanding how to move forward in
this area.

% Current biomedical verb resources are largely manually constructed, or
% constructed using NLP tools which rely on resource-intensive manual
% annotations for adaptation to biomedical subdomains.
% % (see the BioLexicon \citep{sasaki:08}).  
A small number of resources that have been built to support NLP in
biomedicine do contain verb SCF information, including
BioFrameNet \citep{dolbey:06} and the UMLS SPECIALIST
Lexicon \citep{mccray:94}, but the SCF information is manually
constructed.  The BioLexicon \citep{biolexicon,sasaki:08} is the only
such resource containing an automatically constructed SCF
lexicon. However, the BioLexicon includes data from the E. Coli
subdomain alone, and each component used in acquisition of the
BioLexicon -- for example, the part-of-speech tagger, named entity
recognizer, and parser -- has been manually adapted to the subdomain
of molecular biology. Moreover, the definitions of subcategorization
used in different resources are not always uniform and sometimes
differ from the definitions used for general language systems.  The
implications of these different definitions for biomedical SCF
acquisition has not previously been investigated.

Before the field can develop state-of-the-art SCF acquisition methods
for biomedicine, it is necessary to investigate a number of topics
which will help define what such a system should look like. In this
paper we explore the following questions: How does existing
subcategorization acquisition technology, which is either developed
for general language or adapted to biomedicine by means of
manually-built resources focusing on individual subdomains, perform in
the biomedical domain? Among the various
definitions of subcategorization, what are the implications for
biomedicine of choosing one over another? And is there meaningful variation in
subcategorization behavior across subdomains of biomedicine?  The goal
of the paper is to review the state of the art and suggest ways
forward with regard to the challenges in the automatic acquisition of
biomedical SCF lexicons.

To answer our questions we undertake investigations in three main
areas. First, we evaluate the performance of current SCF acquisition
technology on the biomedical domain.  For this purpose we manually
annotate an SCF gold standard, comprising 30 verbs with
subcategorization frames and frequencies, using data from across the PubMed Open Access collection (PMC
OA) \citep{PMC:09}. This is the first gold standard for biomedical SCF in the literature. We compare two automatically acquired SCF lexicons against this gold standard.  The first is
the BioLexicon, which, as described above, was acquired using
components individually adapted to subdomains of biomedicine, and
applied to a corpus representing a particular subdomain. The second is BioVALEX, a new lexicon acquired using a system 
% similar to that 
developed at the University of Cambridge \citep{preiss:07}, which was
previously used for general language, and which we have applied to a
large biomedical corpus without any further adaptation. This paper
represents the first such evaluation of biomedical SCF systems and
allows us to gain insight into how current technology performs and how
SCF information can best be acquired.

Second, we explore the meaning of subcategorization in biomedicine,
particularly in terms of the argument-adjunct distinction and the role
of highly selected adjuncts in biomedical subcategorization. We
manually annotate two SCF gold standards using two different
definitions of subcategorization which are common in NLP, to determine
their impact on the overall shape of the gold standard and the
accuracy of a subcategorization acquisition system.

Finally, we consider how SCFs vary at the level of biomedical
subdomains.  We use the Cambridge system to acquire BioVALEX, a large SCF
lexicon from the PMC OA corpus, representing the largest such corpus
to be used in automatic SCF acquisition.\footnote{Although e.g.~the
SPECIALIST lexicon has broad coverage of subdomains, it is less
comprehensive due to being manually compiled.} We use this new lexicon
to explore subdomain variation in biomedicine by measuring the
difference in subcategorization behavior across subdomains, providing
a new perspective on subdomain variation. We present a detailed
picture of subdomain variation in the behavior of six representative
verbs. The complete, unfiltered SCF lexicon acquired for this investigation will
be made publicly available.

Overall, our investigation seeks to provide the field of biomedical
information processing with a much-needed baseline representing the
current state of the art in SCF acquisition for biomedical text.
Previous work with large datasets for SCF aquisition has focused on
general language, so this investigation contributes towards our
knowledge of how to build domain-specific systems. We find that 
% More generally, we find that 
existing SCF systems suffer performance degradation in
biomedicine compared to general language, with the more
labor-intensive manually adapted system suffering most in recall, and
the unadapted general language system applied to biomedical text
suffering most in precision. 
In addition, we find that 
the
treatment of the argument-adjunct distinction has a major effect on
the ultimate shape of the resulting lexicon, and consequently on
measured performance of SCF acquisition systems. 
Finally, we find major variation between
SCF behavior in biomedical subdomains.  Taken together, these points
suggest the need for domain adaptation, potentially involving
minimally supervised approaches, and at a finer-grained level than
``biomedical text''.


% The hope is that these investigations will provide a starting point for future development, which we will ultimately suggest could profitably involve minimally supervised domain adaptation. 
%  can provide a starting point for future development, which hopefully
% will involve minimally supervised systems.  
